I want to try: distance, month, dest, precip, temp, visib for all of these, I may want to split by origin. My graph options include histogram, geom_point (scatter), geom_jitter, geom_smooth(method=lm), geom_line for line chart, geom_col for discrete x continuous y, geom_boxplot, geom_count and for one var, geom_hist, geom_density, geom_dotplot.
** Important notes you have never got re summarise. Group by the “background” vars you want to see compared to your dependent var. use the summarize command to perform functions on the dependent var. Don’t group by the dependent var**
*Section One First hypothesis: Distance and arrival delay, let’s see if there is a relationship. REMAINDERS: Charts A-3 and A-5 have wasted space in some of the origin facets. REMAINDERS: What we found would make it interesting to check dep_delay and Distance.
A-1: Let’s look at the data.
unique(flights$arr_delay)
## [1] 11 20 33 -18 -25 12 19 -14 -8 8 -2 -3 7 31
## [15] -4 -7 -6 16 -12 -17 32 14 4 -21 -9 3 5 1
## [29] 29 10 0 -19 48 -5 -10 -11 -33 27 -23 2 -31 44
## [43] -15 21 -26 -1 30 26 49 -30 -22 -13 -28 137 23 17
## [57] -16 -40 24 51 15 -24 -39 -27 851 9 6 50 40 -29
## [71] 28 13 42 39 123 145 78 34 22 38 43 -38 36 81
## [85] 53 93 56 37 73 18 46 60 103 84 66 65 83 127
## [99] 52 125 NA 47 35 115 91 136 67 72 -35 80 59 96
## [113] -34 25 41 -32 68 61 107 138 116 338 94 263 -47 -48
## [127] 54 151 166 174 75 222 45 55 250 142 246 191 69 456
## [141] -20 154 -52 -44 -59 90 -41 -36 -45 171 58 70 -57 57
## [155] 173 62 99 97 85 63 98 102 -55 207 -42 288 323 119
## [169] -37 130 109 368 122 82 131 71 141 120 106 -51 77 158
## [183] 180 64 176 88 182 157 359 89 252 110 100 111 101 172
## [197] 143 -43 92 -46 167 133 -65 -63 113 79 128 -49 -61 175
## [211] -53 -56 -60 270 257 105 285 150 -50 -70 126 161 162 118
## [225] 132 87 276 156 112 248 308 213 160 95 155 74 121 86
## [239] 108 117 -54 168 144 184 147 1272 140 76 124 247 1109 178
## [253] 394 292 348 181 200 245 230 149 177 297 189 215 135 612
## [267] 146 185 134 211 114 228 190 104 243 203 272 -64 328 187
## [281] 258 159 206 497 212 199 139 129 193 235 169 299 271 238
## [295] 221 237 232 265 231 163 164 281 186 153 264 152 192 165
## [309] 236 256 196 262 198 255 201 223 233 183 486 194 205 214
## [323] 282 210 273 340 218 370 330 275 226 188 179 249 241 209
## [337] 364 325 225 267 240 239 197 266 229 278 351 -58 -62 219
## [351] 294 148 217 224 253 220 208 227 335 204 268 195 311 170
## [365] 360 216 301 366 202 312 284 688 314 244 350 374 279 381
## [379] 378 291 417 300 318 796 434 345 344 -67 274 302 385 234
## [393] 269 242 313 309 389 346 329 277 321 290 304 614 296 324
## [407] 280 396 334 681 337 283 326 878 377 357 354 261 443 356
## [421] 260 538 422 361 336 307 376 259 254 322 856 386 298 319
## [435] 648 295 395 493 846 251 372 433 847 362 353 469 310 390
## [449] -68 331 397 384 834 327 595 744 767 289 773 -69 -66 387
## [463] 410 305 375 317 436 415 391 293 380 915 343 316 402 303
## [477] 784 506 315 383 306 332 560 399 373 616 931 347 516 428
## [491] 349 333 286 341 435 783 821 405 408 -73 -75 875 369 -74
## [505] -71 -86 398 453 499 495 852 -79 287 461 407 379 448 598
## [519] 352 451 339 420 421 430 468 496 342 1127 411 489 367 444
## [533] 474 424 769 780 802 850 400 320 431 363 490 485 401 419
## [547] 561 382 676 632 404 445 414 441 475 412 551 371 355 645
## [561] 463 989 406 895 393 458 455 577 454 449 358 403 674 505
## [575] 492 572 571 1007
unique(flights$distance)
## [1] 1400 1416 1089 1576 762 719 1065 229 944 733 1028 1005 2475 2565
## [15] 1389 187 2227 1076 1023 1020 502 1085 760 2586 1074 1598 746 2133
## [29] 1096 1008 2153 2454 185 529 212 950 301 1620 2434 2248 1182 1035
## [43] 1990 1147 997 1069 1068 1608 738 2402 1372 1605 264 1080 937 209
## [57] 1010 1041 2422 544 427 200 541 479 820 1029 628 946 340 2446
## [71] 213 419 888 828 740 550 431 509 184 1874 416 488 2425 711
## [85] 4983 764 1521 266 96 1623 1634 404 228 1726 199 725 1047 583
## [99] 273 872 282 645 284 214 708 397 195 463 246 169 1428 1183
## [113] 2465 533 1411 143 116 1969 319 4963 799 445 963 569 425 1504
## [127] 277 665 292 483 594 605 1092 765 1391 589 335 748 94 1107
## [141] 1626 1747 290 1569 642 2569 1134 636 444 2576 2521 1585 1215 1587
## [155] 631 1325 160 1017 1617 585 1131 618 254 964 647 866 269 1167
## [169] 602 288 80 1728 1882 1795 1207 2378 461 833 258 641 722 659
## [183] 660 549 378 812 296 1417 1113 1148 1031 610 1826 500 198 173
## [197] 651 892 604 637 1725 1746 617 1894 662 305 745 599 865 655
## [211] 563 644 3370 17
flights %>%
group_by(distance, origin) %>%
summarise(meandelayQ101 = mean(arr_delay, na.rm = T)) %>%
arrange(distance)
Nothing too obvious here. A-2: Let’s check out distance outliers.
flights %>%
ggplot(aes(x = distance)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
There are some.
A-3 - Let’s facet by origin for our first major two-variable graph.
flights %>%
group_by(distance, origin) %>%
summarise(meandelayQ101 = mean(arr_delay, na.rm = T)) %>%
arrange(distance) %>%
ggplot(aes(y = meandelayQ101, x = distance)) +
geom_point() +
geom_smooth(method = lm) +
facet_wrap(~ origin, scales = "free_x") ##hari made this change on scales
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).
flights
flights %>%
filter(!is.na(distance) & !is.na(air_time)) %>%
group_by(distance) %>%
summarise(meanSpeed = mean(distance, na.rm = T)/mean(air_time, na.rm = T))
flights %>%
filter(!is.na(tailnum)) %>%
group_by(month, day, tailnum) %>%
summarise(nLegs = n(),
meanDistance = mean(distance, na.rm = T)) %>%
arrange(desc(nLegs))
It looks like distance falls as arrival delay increases, except at LaGuardia. A little weird. The segmentation of origin I did is a bit random.
A-4: Does the relationship hold across the combined airports?
flights %>%
group_by(distance, origin) %>%
summarise(meandelayQ101 = mean(arr_delay, na.rm = T)) %>%
arrange(distance) %>%
ggplot(aes(x = meandelayQ101, y = distance)) +
geom_point() +
geom_smooth(method = lm)
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).
It does. A-5: Let’s try minus outlier distances.
flights %>%
group_by(distance, origin) %>%
filter(distance < 3000) %>%
summarise(meandelayQ101 = mean(arr_delay, na.rm = T)) %>%
arrange(distance) %>%
ggplot(aes(x = distance, y = meandelayQ101)) +
geom_point() +
geom_smooth(method = lm) +
facet_wrap(~ origin)
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).
Result: still a negative relationship between distance and delay. weird. Mean delay is highest for very short flights. A-6: Let’s remove the origin faceting and check one more time.
flights %>%
group_by(distance, origin) %>%
filter(distance < 3000) %>%
summarise(meandelayQ101 = mean(arr_delay, na.rm = T)) %>%
arrange(distance) %>%
ggplot(aes(x = distance, y = meandelayQ101)) +
geom_point() +
geom_smooth(method = lm)
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).
There is a trend here.
*SECTION TWO Hypothesis Two: there is a relationship between Precip in weather and Departure Delay. (Spends a full hour understanding joins again) B-1: Let’s make our dataset. Right now, Precip and flights are separated.
weather
Preciptest <- flights %>%
select(time_hour, origin, dep_delay) %>%
left_join(
x = .,
y = weather,
by = c("origin", "time_hour"))
Ok, we have a dataset. Let’s look at precip.
Preciptest
Preciptest %>%
group_by(precip, origin) %>%
summarise(meandepdelayQ201 = mean(dep_delay, na.rm = T)) %>%
arrange(meandepdelayQ201) %>%
ggplot(aes(meandepdelayQ201, y = precip)) +
geom_point() +
geom_smooth(method = lm)
## Warning: Removed 3 rows containing non-finite values (stat_smooth).
## Warning: Removed 3 rows containing missing values (geom_point).
Looks like a bingo. Let’s wrap it by origin.
Preciptest %>%
group_by(precip, origin) %>%
summarise(meandepdelayQ201 = mean(dep_delay, na.rm = T)) %>%
arrange(meandepdelayQ201) %>%
ggplot(aes(meandepdelayQ201, y = precip)) +
geom_point() +
geom_smooth(method = lm) +
facet_wrap(~ origin)
## Warning: Removed 3 rows containing non-finite values (stat_smooth).
## Warning: Removed 3 rows containing missing values (geom_point).
What else can we facet wrap?
Preciptest <- flights %>%
select(time_hour, carrier, origin, dep_delay) %>%
left_join(
x = .,
y = weather,
by = c("origin", "time_hour"))
Preciptest %>%
group_by(precip, carrier) %>%
summarise(meandepdelayQ201 = mean(dep_delay, na.rm = T)) %>%
arrange(meandepdelayQ201) %>%
ggplot(aes(meandepdelayQ201, y = precip)) +
geom_point() +
geom_smooth(method = lm) +
facet_wrap(~ carrier)
## Warning: Removed 31 rows containing non-finite values (stat_smooth).
## Warning: Removed 31 rows containing missing values (geom_point).
Interesting and a bit hard to parse, but the relationship certainly varies by carrier.
#geom_
SECTION THREEE ##Exploring Visib and arrival delay
library(tidyverse)
library(nycflights13)
DataSet4 <- flights %>%
select(time_hour, origin, arr_delay) %>%
left_join(weather, by = c("origin", "time_hour"))
str(flights)
## Classes 'tbl_df', 'tbl' and 'data.frame': 336776 obs. of 19 variables:
## $ year : int 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
## $ month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ day : int 1 1 1 1 1 1 1 1 1 1 ...
## $ dep_time : int 517 533 542 544 554 554 555 557 557 558 ...
## $ sched_dep_time: int 515 529 540 545 600 558 600 600 600 600 ...
## $ dep_delay : num 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
## $ arr_time : int 830 850 923 1004 812 740 913 709 838 753 ...
## $ sched_arr_time: int 819 830 850 1022 837 728 854 723 846 745 ...
## $ arr_delay : num 11 20 33 -18 -25 12 19 -14 -8 8 ...
## $ carrier : chr "UA" "UA" "AA" "B6" ...
## $ flight : int 1545 1714 1141 725 461 1696 507 5708 79 301 ...
## $ tailnum : chr "N14228" "N24211" "N619AA" "N804JB" ...
## $ origin : chr "EWR" "LGA" "JFK" "JFK" ...
## $ dest : chr "IAH" "IAH" "MIA" "BQN" ...
## $ air_time : num 227 227 160 183 116 150 158 53 140 138 ...
## $ distance : num 1400 1416 1089 1576 762 ...
## $ hour : num 5 5 5 5 6 5 6 6 6 6 ...
## $ minute : num 15 29 40 45 0 58 0 0 0 0 ...
## $ time_hour : POSIXct, format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
str(DataSet4)
## Classes 'tbl_df', 'tbl' and 'data.frame': 336776 obs. of 16 variables:
## $ time_hour : POSIXct, format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
## $ origin : chr "EWR" "LGA" "JFK" "JFK" ...
## $ arr_delay : num 11 20 33 -18 -25 12 19 -14 -8 8 ...
## $ year : num 2013 2013 2013 2013 2013 ...
## $ month : num 1 1 1 1 1 1 1 1 1 1 ...
## $ day : int 1 1 1 1 1 1 1 1 1 1 ...
## $ hour : int 5 5 5 5 6 5 6 6 6 6 ...
## $ temp : num 39 39.9 39 39 39.9 ...
## $ dewp : num 28 25 27 27 25 ...
## $ humid : num 64.4 54.8 61.6 61.6 54.8 ...
## $ wind_dir : num 260 250 260 260 260 260 240 260 260 260 ...
## $ wind_speed: num 12.7 15 15 15 16.1 ...
## $ wind_gust : num NA 21.9 NA NA 23 ...
## $ precip : num 0 0 0 0 0 0 0 0 0 0 ...
## $ pressure : num 1012 1011 1012 1012 1012 ...
## $ visib : num 10 10 10 10 10 10 10 10 10 10 ...
DataSet4 %>%
ggplot(aes(x = visib)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1556 rows containing non-finite values (stat_bin).
DataSet4$visib %>%
sort() %>%
unique()
## [1] 0.00 0.06 0.12 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
## [12] 2.50 3.00 4.00 5.00 6.00 7.00 8.00 9.00 10.00
After Trial and Error, arrive at this, which is ugly and needs improvement
DataSet4 %>%
ggplot(aes(x = visib, y = arr_delay)) +
geom_point() +
geom_smooth(method = lm)
## Warning: Removed 10957 rows containing non-finite values (stat_smooth).
## Warning: Removed 10957 rows containing missing values (geom_point).
So the LM line appears to visually disagree with the superficial impression of the points. Are outliers affecting this? Let’s check sample sizes.
DataSet4 %>%
group_by(visib) %>%
summarise(nObs = n())
Sizes are acceptable. I’m not sure what to think about this relationship. The scale of the graph could be affecting the line. Let’s try a box plot. First, i think I need to purge the NAs. First, how many rows are there before I start.
DataSet4 %>%
nrow()
## [1] 336776
Ok, let’s try something re NA purge.
DataSet4 %>%
filter(!is.na(visib))
This is what I have in mind, but it doesn’t work yet.
DataSet4 %>%
group_by(visib) %>%
ggplot(aes(x = as.factor(visib), y = arr_delay), na.rm = T ) +
geom_boxplot()
## Warning: Removed 9430 rows containing non-finite values (stat_boxplot).
Okay, this kinda works! You could sort of read a trend into this, but it seems to decline around “1”
DataSet4 <- DataSet4 %>%
filter(visib > 0.11)
DataSet4 %>%
ggplot(aes(x = visib, y = arr_delay)) +
geom_point() +
geom_smooth(method = lm)
## Warning: Removed 9369 rows containing non-finite values (stat_smooth).
## Warning: Removed 9369 rows containing missing values (geom_point).
A clearer relationship now, via cheating :-D
Comments
Output fromme: clean up vizskim()is not being displayed because you have setresults = "hide"unique(flights$arr_delay)). Moreover, a bunch of numbers on the screen does give the reader any further insights.cut_*()set of function under ggplot)# {.tabset .tabset-fade .tabset-pills}scale_x_log10())Make things more attractive
Due Date: Sunday, April 28th
Keyboard Shortcuts:
Alt+-for Windows andOption+-for MacCtrl+Alt+Ifor Windows andCommand+Option+Ifor MacCtrl+Enterfor Windows andCommand+Enterfor MacCtrl+Shift+Enterfor Windows andCommand+Shift+Enterfor MacCtrl+Shift+Mfor Windows andCommand+Shift+Mfor MacGeneral Instructions
Task: Explore at least 5 different causes for departure and arrival delays (combined - not 5 each). Its not necessary that you find a definitive answer for what causes a delay, instead I want you to showcase your ability to explore the data with a combination of analytical text, tables and charts. The final output should be something you can show a prospective employer or client as proof of you ability to work with data in R.